12 research outputs found

    TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications

    Get PDF
    Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see latency variations both in the network and compute. Previous work proposes to achieve load-proportional energy by slowing down the computation at lower datacenter loads based directly on response times (i.e., at lower loads, the proposal exploits the average slack in the time budget provisioned for the peak load). In contrast, we propose TimeTrader to reduce energy by exploiting the latency slack in the sub- critical replies which arrive before the deadline (e.g., 80% of replies are 3-4x faster than the tail). This slack is present at all loads and subsumes the previous work's load-related slack. While the previous work shifts the leaves' response time distribution to consume the slack at lower loads, TimeTrader reshapes the distribution at all loads by slowing down individual sub-critical nodes without increasing missed deadlines. TimeTrader exploits slack in both the network and compute budgets. Further, TimeTrader leverages Earliest Deadline First scheduling to largely decouple critical requests from the queuing delays of sub- critical requests which can then be slowed down without hurting critical requests. A combination of real-system measurements and at-scale simulations shows that without adding to missed deadlines, TimeTrader saves 15-19% and 41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512 nodes, whereas previous work saves 0% and 31-37%.Comment: 13 page

    Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

    Full text link
    Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to Byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We define the quality of workers as reconstruction ratios ∈(0,1]\in (0,1], and formulate aggregation as a Maximum Likelihood Estimation procedure using Beta densities. We show that the Regularized form of log-likelihood wrt subspace can be approximately solved using iterative least squares solver, and provide convergence guarantees using recent Convex Optimization landscape results. Our empirical findings demonstrate that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators. We evaluate our method in a distributed setup with a parameter server, and show simultaneous improvements in communication efficiency and accuracy across various tasks. The code is publicly available at https://github.com/hamidralmasi/FlagAggregato

    Performance and energy optimizations for online, data-intensive (OLDI) applications and network packet classification

    No full text
    The growth of Internet data coupled with the emergence of interactive Web applications pose unique challenges to performance and energy-efficiency of datacenter and enterprise networks. As applications evolve from offline, throughput-oriented jobs to interactive Web applications, they demand low and predictable latency from the network. Further, because of their tight response time requirements, the interactive applications complicate energy management. On the other hand, as networks become large and functionally-rich, they introduce newer challenges for packet classification. My research addresses performance (i.e., low-latency) and energy optimizations for a class of interactive Web applications, called Online Data-Intensive applications (OLDIs), and packet classification, and it spans from network hardware to transport protocols and applications. Fine-grained network management (e.g., software-defined networking) increases the number of flow-rules which complicates both lookups (i.e., search) and updates in packet classification at the switch hardware. I proposed EffiCuts and TreeCAM that address lookups and updates, respectively. EffiCuts and TreeCAM effectively decouple lookups from updates and employ distinct data structures for lookups and updates. EffiCuts reduces the memory overhead for lookups by two orders of magnitude. TreeCAM reduces the update effort by a factor of 30. Modern interactive Web applications like Web Search operate under soft-real-time constraints (e.g., 300 ms latency) which imply (1) deadlines for network communication and (2) saving energy is hard. For predictable latency, I proposed Deadline-aware Datacenter TCP (D2TCP), a transport protocol that achieves deadline-based prioritization of network flows. I evaluated D2TCP in Google\u27s datacenters and showed that D2TCP reduces missed deadlines by 50% as compared to existing transport protocols. For energy-efficiency, I proposed TimeTrader which exploits slack in sub-critical replies (i.e., replies that do not fall in the tail of the latency distribution). TimeTrader exploits slack from both network and compute layers. Timetrader achieves significant energy savings over existing schemes - 15% at peak load and 40% at 30% load

    MigrantStore: Leveraging Virtual Memory in DRAM-PCM Memory Architecture

    Get PDF
    With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to absorb writes and provide faster access. However, due to ineffectual caching where blocks are evicted before sufficient number of accesses, hardware caches incur significant overheads in energy and bandwidth, two key but scarce resources in modern multicores. Because using hardware for detecting and removing such ineffectual caching would incur additional hardware cost and complexity, we leverage the OS virtual memory support for this purpose. We propose a DRAM-PCM hybrid memory architecture where the OS migrates pages on demand from the PCM to DRAM. We call the DRAM part of our memory as MigrantStore which includes two ideas. First, to reduce the energy, bandwidth, and wear overhead of ineffectual migrations, we propose migration hysteresis. Second, to reduce the software overhead of good replacement policies, we propose recentlyaccessed- page-id (RAPid) buffer, a hardware buffer to track the addresses of recently-accessed MigrantStore pages

    Deadline-Aware Datacenter TCP (D 2 TCP

    No full text
    An important class of datacenter applications, called Online Data-Intensive (OLDI) applications, includes Web search, online retail, and advertisement. To achieve good user experience, OLDI applications operate under soft-real-time constraints (e.g., 300 ms latency) which imply deadlines for network communication within the applications. Further, OLDI applications typically employ tree-based algorithms which, in the common case, result in bursts of children-to-parent traffic with tight deadlines. Recent work on datacenter network protocols is either deadline-agnostic (DCTCP) or is deadline-aware (D 3) but suffers under bursts due to race conditions. Further, D 3 has the practical drawbacks of requiring changes to the switch hardware and not being able to coexist with legacy TCP. We propose Deadline-Aware Datacenter TCP (D 2 TCP), a novel transport protocol, which handles bursts, is deadline-aware, and is readily deployable. In designing D 2 TCP, we make two contributions: (1) D 2 TCP uses a distributed and reactive approach for bandwidth allocation which fundamentally enables D 2 TCP’s properties. (2) D 2 TCP employs a novel congestion avoidance algorithm, which uses ECN feedback and deadlines to modulate the congestion window via a gamma-correction function. Using a small-scale implementation and at-scale simulations, we show that D 2 TCP reduces the fraction of missed deadlines compared to DCTCP and D 3 by 75 % and 50%, respectively

    Deadline-aware datacenter tcp (D2TCP)

    Get PDF
    This Note analyzes the scope of appellate review that should be accorded to a trial judge\u27s determination of nonobviousness. Part I details the condition of nonobviousness and how it has evolved into the principal obstacle to patentability. Part II analyzes the Supreme Court and appellate precedents on the scope of review on this issue. Part III evaluates the policy underpinnings of Rule 52(a) and applies a two-pronged analysis to the nonobviousness requirement to determine whether the clearly erroneous standard of review is appropriate. This Note concludes that the treatment of the nonobviousness determination as a question of law cannot be justified on either analytical or policy grounds, and should be treated as a question of fact subject to the clearly erroneous standard

    EffiCuts: optimizing packet classification for memory and throughput

    No full text
    ABSTRACT Packet Classification is a key functionality provided by modern routers. Previous decision-tree algorithms, HiCuts and HyperCuts, cut the multi-dimensional rule space to separate a classifier's rules. Despite their optimizations, the algorithms incur considerable memory overhead due to two issues: (1) Many rules in a classifier overlap and the overlapping rules vary vastly in size, causing the algorithms' fine cuts for separating the small rules to replicate the large rules. (2) Because a classifier's rule-space density varies significantly, the algorithms' equi-sized cuts for separating the dense parts needlessly partition the sparse parts, resulting in many ineffectual nodes that hold only a few rules. We propose EffiCuts which employs four novel ideas: (1) Separable trees: To eliminate overlap among small and large rules, we separate all small and large rules. We define a subset of rules to be separable if all the rules are either small or large in each dimension. We build a distinct tree for each such subset where each dimension can be cut coarsely to separate the large rules, or finely to separate the small rules without incurring replication. (2) Selective tree merging: To reduce the multiple trees' extra accesses which degrade throughput, we selectively merge separable trees mixing rules that may be small or large in at most one dimension. (3) Equi-dense cuts: We employ unequal cuts which distribute a node's rules evenly among the children, avoiding ineffectual nodes at the cost of a small processing overhead in the tree traversal. (4) Node Co-location: To achieve fewer accesses per node than HiCuts and HyperCuts, we co-locate parts of a node and its children. Using ClassBench, we show that for similar throughput EffiCuts needs factors of 57 less memory than HyperCuts and of 4-8 less power than TCAM
    corecore